Method and apparatus for efficient multidimensional fast fourier transforms

ABSTRACT

A method and apparatus for calculating multidimensional Fast Fourier Transforms (FFTs) efficiently without transpose data flow and with in-place computations. If higher throughput computations are desired, computations are done in pipelined stages with parallel computing devices. A wide range of trade-offs can be made between the computation speed and the hardware complexity. This is based on an extension of Cooley-Tuckey algorithm to n-dimensional data. A mathematical derivation of the algorithm has been provided. This invention makes it possible to perform n-dimensional FFTs, n&gt;1 without relying on one-dimensional FFT computations as in the prior art.

BACKGROUND OF THE INVENTION 1. Field of Invention

This invention relates to processing of numerical data. Morespecifically, this invention relates to the algorithm of computingFourier transform for data in a multidimensional space.

2. Description of the Related Art

The Fourier transform computations for one and higher dimensional dataare well known. In practical applications, the computation of theFourier transform is mostly dependent on the fast algorithm known asFast Fourier Transform (FFT). So far only one-dimensional FFT algorithmhas been used even for n-dimensional Fourier transform computationssince n-dimensional Fourier transform is separable, i.e., n-dimensionalFourier transform computation can be done by a sequence ofone-dimensional FFTs on each dimension. This prior art for computing ann-dimensional FFT incurs a bottleneck in data flow and creates overheadwhen computing the transform for a large data set due to transposeoperations required between successive one-dimensional FFT computations.The overhead increases exponentially as n increases as well as data sizeincreases. Much research has been done to reduce the overhead.

SUMMARY OF THE INVENTION

This invention discloses a novel method and apparatus for performingn-dimensional FFT computations. Said method is based on an extension ofthe well-known Cooley-Tukey 1-D FFT algorithm to n-D FFT which makes itpossible to compute an n-dimensional FFT directly from the input datawithout computing a sequence of 1-D FFTs. In the prior art,n-dimensional FFTs are computed by performing a sequence of 1-D FFTs. Byavoiding the use of 1-D FFT computations, said current invention removesthe transpose operations which causes computation bottleneck in n-D FFT,n>1. Furthermore, said current invention makes it possible to performcomputations with greatest parallelism for highest throughput as well aswith in-place computations requiring smallest amount of memory. In thesaid invention, the computation dependencies are reduced to within basiccomputation blocks called n-D butterflies or n-D quad-flies or n-Dhybrid flies, making it possible to achieve the maximum parallelism. Theillustrations are given only for 2-D cases for simpler graphicalrepresentations, but it generalizes to n-dimensional FFTs as explainedlater.

In a first aspect of the invention for a 2-D FFT embodiment,2-dimensional basic computation blocks are shown in FIG. 1 . FIG. 1 -b,c, d represent all the equivalent 2-D butterflies where 1-D butterflyblock notation in FIG. 1 -a has been used in FIG. 1 -c and FIG. 1 -d.

In another aspect of the invention for a 2-D FFT embodiment,2-dimensional basic computation blocks are shown in FIG. 2 . FIG. 2 -b,c, d represent all the equivalent 2-D quad-flies where 1-D quad-flyblock notation in FIG. 2 -a has been used in FIG. 2 -c and FIG. 2 -d.

In a further aspect of the invention for a 2-D FFT embodiment,2-dimensional computation blocks are shown in FIG. 3 . FIG. 3 -a andFIG. 3 -b are equivalent 2-D hybrid-flies wherein 1-D butterfly and 1-Dquad-fly block notations are used in FIG. 3 -b.

In a further aspect of the invention for a 2-D FFT embodiment, anexemplary implementation with a minimum memory requirement is shown inFIG. 4 -a, wherein the input memories are used repeatedly throughout thecomputation demonstrating in-place computation capability of theinvention.

In a further aspect of the invention for a 2-D FFT embodiment, the datain the memory are processed by a basic computation block exclusively andno dependencies exist amongst basic computation blocks at each stage asshown in FIG. 4 -b for an example of 256×256 2-D FFT.

In a further aspect of the invention, the computation does not havetranspose operations.

In a further aspect of the invention, n-D FFT is implemented in stageswhere the number of stages is determined by the maximum transform sizeamongst the transform sizes in all dimensions.

In a further aspect of the invention, each stage in an n-D FFTimplementation has n-D butterflies or n-D quad-flies or n-D hybridfiles.

BRIEF DESCRIPTIONS OF THE DRAWINGS

FIG. 1 depicts a construction of a 2-D butterfly;

FIG. 1 -a is the 1-D butterfly where small bubble represents amultiplication by −1.

FIG. 1 -b represents the 1-D butterfly in FIG. 1 -a as a butterflysymbol.

FIG. 1 -c represents a 2-D butterfly with a 2×2 input data.

FIG. 1 -d represents the same 2-D butterfly as FIG. 1 -c using the 1-Dbutterfly symbol in FIG. 1 -b.

FIG. 1 -e represents the same 2-D butterfly as FIG. 1 -d drawn in aperspective.

FIG. 2 depicts construction of a 2-D quad-fly;

FIG. 2 -a is the 1-D quad-fly component where small bubbles representmultiplications with coefficients above.

FIG. 2 -b represents the 1-D quad-fly in FIG. 2 -a as a quad-fly symbol

FIG. 2 -c represents a 2-D quad-fly with a 4×4 input data.

FIG. 2 -d represents the same 2-D quad-fly as in FIG. 2 -c drawn in aperspective.

FIG. 3 depicts construction of 2-D hybrid-flies;

FIG. 3 -a depicts construction of a 2-D hybrid-fly made of 1-Dbutterflies and 1-D quad-flies.

FIG. 3 -b depicts the same hybrid-fly as FIG. 3 -a, but using the blocksymbols.

FIG. 4 depicts an exemplary 2-D FFT computations;

FIG. 4 -a depicts a block diagram of an exemplary 2-D FFT computation.

FIG. 4 -b depicts data in the buffer as an exemplary 2-D FFT computationprogresses.

FIG. 5 depicts an exemplary pipelined and parallel n-dimensional FFTcomputational blocks.

FIG. 6 depicts a 3-D butterfly.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The bottleneck in 2-D and higher dimensional FFT computations in theprior art stems from the transpose operations. The transpose operationsare required in the prior art because 1-D FFT computations are used for2-D and higher dimensional discrete Fourier transforms. This inventiondiscloses a novel 2-D and higher dimensional FFT computation methodswithout using 1-D FFT. To this end we generalize the Cooley-Tukeyalgorithm in 1-D FFT to n-D FFT. We first follow the similar steps ofCooley-Tukey algorithm development in the 1-D case, i.e., even-odddecomposition, but we disclose a novel approach for extending it to n-Dcase utilizing signal sampling concept, not limiting ourselves to acomplexity reduction by decomposition and factorization as in the priorart. We start from a simpler 2-D case and generalize to then-dimensional case. In the 2-D case, even-odd decomposition of samplingindexes results in the following 4 components as;

$\begin{matrix}{{x\left( {n_{1},n_{2}} \right)} = {{x^{00}\left( {n_{1},n_{2}} \right)} + {x^{01}\left( {n_{1},n_{2}} \right)} + {x^{10}\left( {n_{1},n_{2}} \right)} + {x^{11}\left( {n_{1},n_{2}} \right)}}} & (1)\end{matrix}$ where $\begin{matrix}{{x^{00}\left( {n_{1},n_{2}} \right)} = \left\{ \begin{matrix}{{x\left( {n_{1},n_{2}} \right)},} & {\left( {n_{1},n_{2}} \right) = \left( {{even},{even}} \right)} \\{0,} & {otherwise}\end{matrix} \right.} & (2)\end{matrix}$${x^{01}\left( {n_{1},n_{2}} \right)} = \left\{ \begin{matrix}{{x\left( {n_{1},n_{2}} \right)},} & {\left( {n_{1},n_{2}} \right) = \left( {{even},{odd}} \right)} \\{0,} & {otherwise}\end{matrix} \right.$${x^{10}\left( {n_{1},n_{2}} \right)} = \left\{ \begin{matrix}{{x\left( {n_{1},n_{2}} \right)},} & {\left( {n_{1},n_{2}} \right) = \left( {{odd},{even}} \right)} \\{0,} & {otherwise}\end{matrix} \right.$${x^{11}\left( {n_{1},n_{2}} \right)} = \left\{ \begin{matrix}{{x\left( {n_{1},n_{2}} \right)},} & {\left( {n_{1},n_{2}} \right) = \left( {{odd},{odd}} \right)} \\{0,} & {otherwise}\end{matrix} \right.$

This decomposition is different from the Cooley-Tukey's even-odddecomposition setting aside the dimensionality. The subsampledcomponents have the same lengths as the original signal since theymaintain the original sampling interval with replaced zero samplevalues. The discrete Fourier transform on the both sides of Eq. (1)gives,

X ₀ =X ₀ ⁰⁰ +X ₀ ⁰¹ +X ₀ ¹⁰ +X ₀ ¹¹  (3)

where the capital letters represent the DFT of the input data; X₀⇔x(n₁,n₂), X₀ ^(ij)⇔x₀ ^(ij)⇔x₀ ^(ij)(n₁, n₂), where i, j=0,1. The subscript 0was used to represent the original sampling domain which is the inputdata sampling domain. Define a new signal with a smaller signal supportin the 2-D space from the subsampled signals above by completelyremoving the zeros outside of the signal indexes given by the even-odddecomposition. For example, a shorter signal x₁ ^(ij)(n₁, n₂) is definedfrom the subsampled signal in the original sampling domain as follows;

$\begin{matrix}{{{x_{1}^{ij}\left( {n_{1},n_{2}} \right)} = {x\left( {{{2n_{1}} + i},{{2n_{2}} + j}} \right)}},i,{j = 0},1,} & (4)\end{matrix}$${n_{1} = 0},1,\ldots,\frac{N_{1}}{2},{n_{42} = 0},1,\ldots,\frac{N_{2}}{2}$

A subscript “1” is used to denote the new decimated domain. The DFT ofthe subsampled components in the original sampling domain is written interms of the DFT in the decimated domain as:

$\begin{matrix}{{X_{0}^{ij} \equiv {X_{0}^{ij}\left( {k_{1},k_{2}} \right)}} = {{W_{N_{1}}^{ik_{1}}W_{N_{2}}^{jk_{2}}{\sum_{n_{2} = 0}^{\frac{N_{2}}{2} - 1}{\sum_{n_{1} = 0}^{\frac{N_{1}}{2} - 1}{{x_{1}^{ij}\left( {n_{1},n_{2}} \right)}W_{N_{1}/2}^{n_{1}k_{1}}W_{N_{2}/2}^{n_{2}k_{2}}}}}} = {W_{N_{1}}^{ik_{1}}W_{N_{2}}^{jk_{2}}{X_{1}^{ij}\left( {k_{1},k_{2}} \right)}}}} & (5)\end{matrix}$ whereX₁^(ij) ⇔ x₁^(ij)(n₁, n₂).

The DFTs of x₁ ^(ij)(n₁,n₂) in the decimated domain in Eq. (4) arerelated to the original DFT X₀ by

$\begin{matrix}{{\begin{bmatrix}{X_{0}\left( {k_{1},k_{2}} \right)} \\{X_{0}\left( {{k_{1} + \frac{N_{1}}{2}},k_{2}} \right)} \\{X_{0}\left( {k_{1},{k_{2} + \frac{N_{2}}{2}}} \right)} \\{X_{0}\left( {{k_{1} + \frac{N_{1}}{2}},{k_{2} + \frac{N_{2}}{2}}} \right)}\end{bmatrix} = \begin{bmatrix}1 & W_{N_{2}}^{k_{2}} & W_{N_{1}}^{k_{1}} & {W_{N_{2}}^{k_{2}}W_{N_{1}}^{k_{1}}} \\1 & {- W_{N_{2}}^{k_{2}}} & W_{N_{1}}^{k_{1}} & {{- W_{N_{2}}^{k_{2}}}W_{N_{1}}^{k_{1}}} \\1 & W_{N_{2}}^{k_{2}} & {- W_{N_{1}}^{k_{1}}} & {{- W_{N_{2}}^{k_{2}}}W_{N_{1}}^{k_{1}}} \\1 & {- W_{N_{2}}^{k_{2}}} & {- W_{N_{1}}^{k_{1}}} & {W_{N_{2}}^{k_{2}}W_{N_{1}}^{k_{1}}}\end{bmatrix}}\text{ }\begin{bmatrix}\begin{matrix}\begin{matrix}{X_{1}^{00}\left( {k_{1},k_{2}} \right)} \\{X_{1}^{01}\left( {k_{1},k_{2}} \right)}\end{matrix} \\{X_{1}^{10}\left( {k_{1},k_{2}} \right)}\end{matrix} \\{X_{1}^{11}\left( {k_{1},k_{2}} \right)}\end{bmatrix}} & (6)\end{matrix}$ wherek₁ = 0, 1, …, N₁/2, andk₂ = 0, 1, …, N₂/2.

The matrix in Eq. (6) is denoted as T₄ and it can be factored asfollows.

$\begin{matrix}{T_{4} = {\begin{bmatrix}1 & W_{N_{2}}^{k_{2}} & W_{N_{1}}^{k_{1}} & {W_{N_{2}}^{k_{2}}W_{N_{1}}^{k_{1}}} \\1 & {- W_{N_{2}}^{k_{2}}} & W_{N_{1}}^{k_{1}} & {{- W_{N_{2}}^{k_{2}}}W_{N_{1}}^{k_{1}}} \\1 & W_{N_{2}}^{k_{2}} & {- W_{N_{1}}^{k_{1}}} & {{- W_{N_{2}}^{k_{2}}}W_{N_{1}}^{k_{1}}} \\1 & {- W_{N_{2}}^{k_{2}}} & {- W_{N_{1}}^{k_{1}}} & {W_{N_{2}}^{k_{2}}W_{N_{1}}^{k_{1}}}\end{bmatrix} = {\left( {\begin{bmatrix}1 & 1 \\1 & {- 1}\end{bmatrix}\begin{bmatrix}1 & 0 \\0 & W_{N_{1}}^{k_{1}}\end{bmatrix}} \right) \otimes \left( {\begin{bmatrix}1 & 1 \\1 & {- 1}\end{bmatrix}\begin{bmatrix}1 & {01} \\0 & W_{N_{2}}^{k_{2}}\end{bmatrix}} \right)}}} & (7)\end{matrix}$

where the symbol ⊗ represents the tensor product, defined for a 2×2matrix X and another matrix Y as follows;

$\begin{matrix}{{X \otimes Y} = {{\begin{bmatrix}x_{11} & x_{12} \\x_{21} & x_{22}\end{bmatrix} \otimes Y} = \begin{bmatrix}{x_{11}Y} & {x_{12}Y} \\{x_{21}Y} & {x_{22}Y}\end{bmatrix}}} & (8)\end{matrix}$

Eq. (7) can be written simply as

T ₄=(BW _(N) ₁ _(k) ₁ )⊗(BW _(N) ₂ _(k) ₂ )∝T _(2×2)   (9)

with definitions

$\begin{matrix}{{B = \begin{bmatrix}1 & 1 \\1 & {- 1}\end{bmatrix}},{W_{N,k} = \begin{bmatrix}1 & 0 \\0 & W_{N}^{k}\end{bmatrix}}} & (10)\end{matrix}$

and the matrix T_(2×2) represents T₄ in a factorized form. The matrixB_(N)(k)∝BW_(N,k) represents a butterfly operation.

This allows for a more efficient computation of Eq. (6) in two steps asfollows:

Step 1: Butterfly operations in k₁ direction:

$\begin{matrix}{\begin{bmatrix}a \\b\end{bmatrix} = {{B_{N_{1}}\left( k_{1} \right)}\begin{bmatrix}{X_{1}^{00}\left( {k_{1},k_{2}} \right)} \\{X_{1}^{10}\left( {k_{1},k_{2}} \right)}\end{bmatrix}}} & (11)\end{matrix}$ and $\begin{matrix}{\begin{bmatrix}c \\d\end{bmatrix} = {{B_{N_{1}}\left( k_{1} \right)}\begin{bmatrix}{X_{1}^{01}\left( {k_{1},k_{2}} \right)} \\{X_{1}^{11}\left( {k_{1},k_{2}} \right)}\end{bmatrix}}} & (12)\end{matrix}$

Step 2: Butterfly operations in k₂ direction:

$\begin{matrix}{\begin{bmatrix}{X_{0}\left( {k_{1},k_{2}} \right)} \\{X_{0}\left( {k_{1},{k_{2} + {N_{2}/2}}} \right.}\end{bmatrix} = {{B_{N_{2}}\left( k_{2} \right)}\begin{bmatrix}a \\c\end{bmatrix}}} & (13)\end{matrix}$ and $\begin{matrix}{\begin{bmatrix}{X_{0}\left( {{k_{1} + {N_{1}/2}},k_{2}} \right)} \\{X_{0}\left( {{k_{1}N_{1}/2},{k_{2} + {N_{2}/2}}} \right.}\end{bmatrix} = {{B_{N_{2}}\left( k_{2} \right)}\begin{bmatrix}b \\d\end{bmatrix}}} & (14)\end{matrix}$

This decomposition allows a reduced complexity calculation, and its dataflow is depicted in FIG. 1 -c and FIG. 1 -d. It is possible to changethe order of butterfly operations, if necessary, with properrearrangement of input vector elements. The T_(2×2) matrix represents a2-D basic computation block.

Radix-4 decomposition: A similar development is made by using thedecimation-by-four for a 2-D signal, x(n₁, n₂), n₁=0, 1, . . . , N₁−1,n₂=0, 1, . . . , N₂−1, wherein we assume the size in each dimension ispower of 4, i.e., N_(s)=4^(M) ^(s) , M_(i) positive integers, with s=0,1.

x(n ₁ ,n ₂)=Σ_(i,j=0,1,2,3) x ₀ ^(i,j)(n ₁ ,n ₂)  (15)

The sub-sampled components are given by;

$\begin{matrix}{{x_{0}^{i,j}\left( {n_{1},n_{2}} \right)} = \left\{ \begin{matrix}{{x\left( {n_{1},n_{2}} \right)},} & \begin{matrix}{{n_{1} = {{4n_{1}^{\prime}} + i}},{n_{2} = {{4n_{2}^{\prime}} + j}},{n_{1}^{\prime} =}} \\{{0,1},\ldots,{\frac{N_{1}}{4} - 1},{n_{2}^{\prime} = {0,1}},\ldots,{\frac{N_{2}}{4} - 1}}\end{matrix} \\{0,} & {otherwise}\end{matrix} \right.} & (16)\end{matrix}$

Performing DFT on both sides of Eq. (15) the DFT of the input signal isrepresented as a sum of the DFTs of subsampled components x₀ ^(i,j)(n₁,n₂), i,j=0, 1, 2, 3, as follows

X ₀ =X ₀ ⁰⁰ +X ₀ ⁰¹ +X ₀ ⁰² +S ₀ ⁰³ +S ₀ ¹⁰ + . . . +X ₀ ³⁰ +X ₀ ³¹ +X ₀³² +X ₀ ³³  (17)

where X₀⇔x(n₁, n₂), X₀ ^(i,j)⇔x₀ ^(i,j)(n₁,n₂), i,j=0, 1, 2, 3.

Using the definition of signals in the decimated domain as,

$\begin{matrix}{{{x_{1}^{ij}\left( {n_{1},n_{2}} \right)} = {x\left( {{{4n_{1}} + i},{{4n_{2}} + j}} \right)}},i,{j = {0,1,2,3}}} & (18)\end{matrix}$${n_{1} = {0,1}},\ldots,{\frac{N_{1}}{4} - 1},{n_{2} = {0,1}},\ldots,{\frac{N_{2}}{4} - 1}$

the DFT components in Eq. (17) can be written as;

$\begin{matrix}{{X_{0}^{ij} \equiv {X_{0}^{ij}\left( {k_{1},k_{2}} \right)}} = {{W_{N_{1}}^{ik_{1}}W_{N_{2}}^{jk_{2}}{\sum_{n_{2} = 0}^{\frac{N_{2}}{4} - 1}{\sum_{n_{1} = 0}^{\frac{N_{1}}{4} - 1}{{x_{1}^{ij}\left( {n_{1},n_{2}} \right)}W_{N_{1}/4}^{n_{1}k_{1}}W_{N_{2}/4}^{n_{2}k_{2}}}}}} = {W_{N_{1}}^{ik_{1}}W_{N_{2}}^{jk_{2}}{X_{1}^{ij}\left( {k_{1},k_{2}} \right)}}}} & (19)\end{matrix}$ whereX₁^(ij)(k₁, k₂) ⇔ x₁^(ij)(n₁, n₂).

The overall relationship between the original DFT and the DFT componentsin the decimated domain is given by

$\begin{matrix}{\begin{bmatrix}{X_{0}\left( {k_{1},k_{2}} \right)} \\{X_{0}\left( {k_{1},{k_{2} + \frac{N_{2}}{4}}} \right)} \\{X_{0}\left( {k_{1},{k_{2} + \frac{N_{2}}{2}}} \right)} \\ \vdots \\{X_{0}\left( {{k_{1} + \frac{3N_{1}}{4}},{k_{2} + \frac{3N_{2}}{4}}} \right)}\end{bmatrix} = {T_{16}\begin{bmatrix}\begin{matrix}\begin{matrix}{X_{1}^{00}\left( {k_{1},k_{2}} \right)} \\{X_{1}^{01}\left( {k_{1},k_{2}} \right)}\end{matrix} \\{X_{1}^{02}\left( {k_{1},k_{2}} \right)} \\ \vdots \end{matrix} \\{X_{1}^{33}\left( {k_{1},k_{2}} \right)}\end{bmatrix}}} & (20)\end{matrix}$ where $\begin{matrix}{T_{16} = {T_{4 \times 4} - {\left( {\begin{bmatrix}1 & 1 & 1 & 1 \\1 & {- j} & {- 1} & j \\1 & {- 1} & 1 & {- 1} \\1 & j & {- 1} & {- j}\end{bmatrix}\begin{bmatrix}\begin{matrix}1 & 0 \\0 & W_{N_{1}}^{k_{1}}\end{matrix} & \begin{matrix}0 & 0 \\0 & 0\end{matrix} \\\begin{matrix}0 & 0 \\0 & 0\end{matrix} & \begin{matrix}W_{N_{1}}^{2k_{1}} & 0 \\0 & W_{N_{1}}^{3k_{1}}\end{matrix}\end{bmatrix}} \right) \otimes \left( {\begin{bmatrix}1 & 1 & 1 & 1 \\1 & {- j} & {- 1} & j \\1 & {- 1} & 1 & {- 1} \\1 & j & {- 1} & {- j}\end{bmatrix}\begin{bmatrix}\begin{matrix}1 & 0 \\0 & W_{N_{2}}^{k_{2}}\end{matrix} & \begin{matrix}0 & 0 \\0 & 0\end{matrix} \\\begin{matrix}0 & 0 \\0 & 0\end{matrix} & \begin{matrix}W_{N_{2}}^{2k_{2}} & 0 \\0 & W_{N_{2}}^{3k_{2}}\end{matrix}\end{bmatrix}} \right)}}} & (21)\end{matrix}$

Eq. (21) can be written as

T _(4×4)=(QW _(N) ₁ _(,k) ₁ )(QW _(N) ₂ _(,k) ₂ )  (22)

with definitions

$\begin{matrix}{{Q = \begin{bmatrix}1 & 1 & 1 & 1 \\1 & {- j} & {- 1} & j \\1 & {- 1} & 1 & {- 1} \\1 & j & {- 1} & {- j}\end{bmatrix}},{{{and}W_{n,k}} = \begin{bmatrix}\begin{matrix}1 & 0 \\0 & W_{N}^{k}\end{matrix} & \begin{matrix}0 & 0 \\0 & 0\end{matrix} \\\begin{matrix}0 & 0 \\0 & 0\end{matrix} & \begin{matrix}W_{N}^{2k} & 0 \\0 & W_{N}^{3k}\end{matrix}\end{bmatrix}}} & (23)\end{matrix}$

The matrix Q_(N)(k)∝QW_(N,k) represents a 1-D quad-fly operation for thecomputation of transform size N from the previous transform of size ofN/4 on a given axis. The matrix T_(4×4) represents a 2-D quad-fly.

The reduction of the DFT sizes of the components at each stage isachieved either by butterfly or quad-fly, depending on even-odddecomposition or radix-4 decomposition. The reduction process isrepeated until the final 2-point or 4-point DFT size is reached on allaxis for a given transform size N=2m, m a positive integer.

A similar decomposition technique can be applied to a 3-dimensional datax(n₁, n₂, n₃), n₁=0,1, . . . , N₁−1, n₂=0,1, . . . , N₂−1, n₃=0,1, . . ., N₃−1. The process is identical except the fact that the dimensionalityhas increased, and as a result, the number of components increased aftereven-odd decomposition or radix-4 decompositions. For example, even-odddecomposition of a 3-D signal would generate a 8×8 T matrix since thereare 8 components after the decomposition and the T matrix is representedas;

T _(2×2×2)=(BW _(N) ₁ _(k) ₁ )⊗(BW _(N) ₂ _(k) ₂ )(BW _(N) ₃ _(k) ₃)  (24)

FIG. 6 shows the data flow for the computation of Eq. (24).

Similarly, radix-4 decomposition of a 3-D signal would generate a 64×64T matrix as;

T _(4×4×4)=(QW _(N) ₁ _(k) ₁ )⊗(QW _(N) ₂ _(k) ₂ )⊗(QW _(N) ₃ _(k) ₃)  (25)

In general, radix4 decomposition of a n-dimensional signal would give a4^(n)×4^(n) T matrix as;

T _(4×4× . . . ×4)=(QW _(N) ₁ _(k) ₁ )⊗(QW _(N) ₂ _(k) ₂ )⊗ . . . ⊗(QW_(N) _(n) _(k) _(n) )

In an exemplary embodiment of 256×256 2-D FFT, the input data are storedin the input buffer in a radix-4 reversed manner. The radix-4 reversalis achieved by the address mapping: from address-in=b₇b₆b₅b₄b₃b₂b₁b₀ toaddress-out=(b₁b₀)(b₃b₂)(b₅b₄)(b₇b₆). The address mapping is applied toboth row and column addresses respectively, as shown by blocks 400 and401 in FIG. 4 -a.

Within each 2-D quad-fly computations, row and column direction 1-Dquad-fly computations are performed in sequential manner as shown inFIG. 2 -b or FIG. 2 -c. The results are stored back in the input bufferat the same 4×4 location within the memory block 402. All the quad-fliescan be computed independently since no data dependency exists amongstquad-fly computations. The 4×4 block 511 at the top-left of the Stage-1output in FIG. 4 -b represent the inputs and outputs of the first 2-Dquad-fly in the memory buffer 402 in FIG. 4 -a. There are total of64×64=4096 of 2-D quad-fly inputs and outputs as indicated by suchblocks 511, 512 and 513, which can be computed all independently. AtStage-2, 16×16 2-D DFT computations are performed on 16×16 array of suchdata blocks. The first of such blocks, block 521 in FIG. 4 -b has 16×16data elements for a 16×16 2-D DFT computation. The small dark squares atthe top-left corner 522, 523, etc. of inner squares represent a 4×4input and 4×4 output data of the first 2-D quad-fly for the computationof 16×16 2-D DFT. The distances between dark squares, for example, 522and 523 are 4 and a total of 16 2-D quad-flies are used to process theblock 521.

All the 16 quad-flies compute independent of each other with its owninput and output data sets. The twiddle factors are computed accordingto the locations of the input data within the block 521, with the targetDFT size N=16 according to Eq. (23). Since there are 16×16 of suchcomputations and each computation has 16 independent 2-D quad-flycomputations, the total number of independent 2-D quad-fly computationsare still the same at (16×16)×16=4096.

At Stage-3, 64×64 2-D DFT computations are performed. The first of suchblocks, 531, has 64×64 data elements as inputs for a 64×64 2-D DFT. Thesmall dark squares, for example, 532 and 533 at the top-left corner ofinner squares of the block 531 represent a 4×4 input and 4×4 output fora 2-D quad-fly. The distances between adjacent dark squares are 16 and atotal of 16×16=256 2-D quad-fly computations inside the block 531. Allthe 256 quad-files compute independent of each other with its own inputsand outputs. The twiddle factors are computed according to the locationsof the input data within the block 531, with the target DFT size N=64according to Eq. (23). Since there are 4×4 of such computations and eachcomputation has 256 independent 2-D quad-fly computations, the totalnumber of independent 2-D quad-fly computations are still the same at(4×4)×256=4096.

At Stage-4, the final 256×256 2-D FFT computation is performed on thewhole data in the memory. The small dark squares, for example, 542 and543, at the top-left corner of inner squares of the block 531 representa 4×4 input and 4×4 output for a 2-D quad-fly. The distances betweenadjacent dark squares are 64. The total number of 2-D quad-flycomputations is 64×64=4096. The twiddle factors are computed accordingto the locations of the input data within the block 504, with the targetDFT size N=256 according to Eq. (23).

Another embodiment of the current invention is presented for higherthroughput computations. A pipelined implementation of an n-dimensionalFFT is disclosed in FIG. 5 . Instead of utilizing the same memory asinput buffers and output buffers for each stage of computations, adedicated ping-pong input buffer is used at each stage, from Stage-1 toStage-S, where the number of stages S is determined by the transformsize as, S=round-up (log 4(maximum (sizes of all axes)), as shown inFIG. 5 . Only the final stage, Stage-S has its own output buffer 605since the input ping-pong buffers at Stage-i are used as output buffersfor the previous stage, Stage-(i−1), i=2, 3, . . . , S. For example, theinput ping-pong buffer 603 is used as an output buffer for Stage-1. The2-D quad-fly array blocks 602 and 604, for example, can be implementedfor full parallel computations for the highest throughput. However, thenumber of parallel n-D quad-flies maybe limited by the number of memoryread/write ports and throughputs in blocks 601 and 602, etc. A goodtradeoff can be made by choosing a proper number of memory ports and thenumber of parallel 2-D quad-flies.

An example of 3-D butterfly is shown in FIG. 6 using a 1-D butterfly inFIG. 1 -a. In a similar way, a 3-D quad-fly can be implemented using 1-Dquad-fly in FIG. 2 -a. In fact, it can be generalized to n-dimensionalbutterfly and quad-flies using the equations (24) and (26).

What is claimed is:
 1. A method of computing n-dimensional FFTcomprising: reading input data, computing n-dimensional basiccomputation blocks, and generating transform result in the output bufferwithout transpose data flow and with minimum memory requirement, whereinn-dimensional basic computation blocks perform n-dimensional butterfliesor n-dimensional quad-flies or n-dimensional hybrid-flies for thedimension n greater than or equals to
 2. 2. The method of claim 1,wherein said computations are done in stages with increasing sizes ofn-dimensional FFTs until the desired n-dimensional FFT size is achieved.3. The method of claim 1, said n-dimensional butterfly, quad-fly andhybrid-fly computations are performed by one-dimensional butterflyand/or one-dimensional quad-fly computations in a sequential order foreach dimension.
 4. The method of claim 1, wherein n-dimensional basiccomputations are performed in a serial manner utilizing a singlen-dimensional basic computation block repeatedly.
 5. The method of claim1, wherein n-dimensional basic computations are performed in parallelutilizing a plurality of n-dimensional basic computation blocks.
 6. Themethod of claim 1, wherein n-dimensional basic computations areperformed in parallel using a combination of thread-parallel and/orhardware-parallel processing units with or without a central processingunit.
 7. The method of claim 2, wherein said computations in said stagesare done in-place without requiring an additional memory buffer fortranspose data flow.
 8. The method of claim 2, wherein said computationsin said stages are done in a pipelined manner with input buffer andoutput buffer for each said stage.
 9. The method of claim 8, whereinsaid input and output buffers are accessed in a pipelined and parallelmanner using multi-port ping-pong buffers.
 10. An apparatus forcomputing n-dimensional FFT comprising: an input buffer and an outputbuffer and a single or a plurality of n-dimensional basic computationblocks which perform n-dimensional butterflies, n-dimensional quad-fliesand n-dimensional hybrid-flies wherein integer n is greater than orequals to
 2. 11. The apparatus of claim 10, wherein said input bufferand output buffer are implemented using an identical memory block forin-place computation.
 12. The apparatus of claim 10, wherein saidcomputations are performed in stages with increasing n-dimensional FFTsizes, wherein each stage has its own input and output buffers forpipelined operations of all the stages.
 13. The apparatus of claim 10,wherein said single basic computation block is implemented in a CPUprogram and/or in FPGA and/or in custom circuits, and/or aprocessor-in-memory.
 14. The apparatus of claim 10, wherein said aplurality of n-dimensional basic computation blocks are implemented inCPU programs and/or in FPGA and/or in custom circuits, and/orprocessors-in-memory.
 15. The apparatus of claim 10, wherein said inputbuffer and output buffer are implemented using multi-port ping-pongmemory buffers for parallel and pipelined data access.